[OPT] Tail Loop Optimization #1567

briannwu · 2025-01-17T14:43:46Z

details:

Separate tailLoopOpt for A / B: tailLoopOptA / tailLoopOptB.
Not supported: DTV, SparseGemm.
Reorder load instructions with more vgprs.

Compare:

globalReadMode = 2 -> use more vgpr to reorder GR, waitcnt, v_or_b32 instructions

Before:
/* g2l=0, load component 0 /
buffer_load_ubyte_d16 v[vgprG2LA+0+0], ..., 0 offen offset:0 // load one buffer value
/ g2l=0, load component 1 /
buffer_load_ubyte_d16 v0, ..., 0 offen offset:1 // load one buffer value
s_waitcnt vmcnt(0)
v_lshlrev_b32 v0, 0x8, v0 // shift left to higher 8 bits
v_or_b32 v[vgprG2LA+0+0], v[vgprG2LA+0+0], v0 // pack a sub 8-bit with dest
/ g2l=0, load component 0 /
buffer_load_ubyte_d16 v[vgprG2LA+0+4], ... offen offset:0 // load one buffer value
/ g2l=0, load component 1 */
buffer_load_ubyte_d16 v0, ... offen offset:1 // load one buffer value
s_waitcnt vmcnt(0)
v_lshlrev_b32 v0, 0x8, v0 // shift left to higher 8 bits
v_or_b32 v[vgprG2LA+0+4], v[vgprG2LA+0+4], v0 // pack a sub 8-bit with dest
...

After:
buffer_load_ubyte_d16 v[vgprG2LA+0+0], ... offen offset:0 // load one buffer value
buffer_load_ubyte_d16 v0, ..., 0 offen offset:1 // load one buffer value
buffer_load_ubyte_d16 v[vgprG2LA+0+4], ... offen offset:0 // load one buffer value
buffer_load_ubyte_d16 v1, ... offen offset:1 // load one buffer value
buffer_load_ubyte_d16 v[vgprG2LA+1+0], offen offset:0 // load one buffer value
...
s_waitcnt vmcnt(10)
v_lshlrev_b32 v0, 0x8, v0 // shift left to higher 8 bits
v_or_b32 v[vgprG2LA+0+0], v[vgprG2LA+0+0], v0 // pack a sub 8-bit with dest
s_waitcnt vmcnt(8)
v_lshlrev_b32 v1, 0x8, v1 // shift left to higher 8 bits
v_or_b32 v[vgprG2LA+0+4], v[vgprG2LA+0+4], v1 // pack a sub 8-bit with dest
...

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////
globalReadMode = 3 -> use wider global load instructions
Before:
/* g2l=0, load component 0 /
buffer_load_ubyte_d16 v[vgprG2LB+0+0], ..., 0 offen offset:0 // load one buffer value
/ g2l=0, load component 1 /
buffer_load_ubyte_d16 v51, ..., 0 offen offset:1 // load one buffer value
/ g2l=0, load component 2 /
buffer_load_ubyte_d16_hi v52, ..., 0 offen offset:2 // load one buffer value
/ g2l=0, load component 3 */
buffer_load_ubyte_d16_hi v53, ..., 0 offen offset:3 // load one buffer value
...
s_waitcnt vmcnt(14)
v_lshlrev_b32 v51, 0x8, v51 // shift left to higher 8 bits
v_or_b32 v[vgprG2LB+0+0], v[vgprG2LB+0+0], v51 // pack a sub 8-bit with dest
s_waitcnt vmcnt(13)
v_or_b32 v[vgprG2LB+0+0], v[vgprG2LB+0+0], v52 // pack a sub 8-bit with dest
s_waitcnt vmcnt(12)
v_lshlrev_b32 v53, 0x8, v53 // shift left to higher 8 bits
v_or_b32 v[vgprG2LB+0+0], v[vgprG2LB+0+0], v53 // pack a sub 8-bit with dest
...

After:
buffer_load_dwordx4 v[vgprG2LB+0:vgprG2LB+0+3], v[vgprGlobalReadOffsetB+0], s[sgprSrdB:sgprSrdB+3], 0 offen offset:0 // G -> Reg 0_0_0_0
... (calculate some data to determine how to load the last data)
label_LoadB:
... (jump to specified load tile)
label_LOAD_B0:
label_LOAD_B0_K1:
s_cmp_ge_u32 s11, 1
s_cbranch_scc0 label_MergeB
/* g2l=0, load component 0 */
buffer_load_ubyte_d16 v54, ... 0 offen offset:0 // load one buffer value
label_LOAD_B0_K2
...
label_LOAD_B0_K15:
... (load code)
s_branch label_MergeB
label_MergeB:
... (jump to specified load tile)
label_MERGE_B0:
label_MERGE_B0_K1:
s_cmp_ge_u32 s11, 1
s_cbranch_scc0 label_CheckB_OOB
s_waitcnt vmcnt(0)
v_or_b32 v[vgprG2LB+0+0], v[vgprG2LB+0+0], v54 // pack a sub 8-bit with dest
label_MERGE_B0_K2:
...
label_MERGE_B0_K15:
... (pack code)
s_branch label_CheckB_OOB
label_CheckB_OOB:
...
label_CheckLoopBeginB:
... (calculate size to be loaded and size can be loaded)
label_B0:
... (check if there's other tile should be loaded again due to OOB)
s_cbranch_scc1 label_LoadB // Reload
s_branch label_CheckLoopBeginB // Re check
label_TailGlobalLoadEnd:
s_waitcnt vmcnt(0)

hcman2 · 2025-01-20T01:41:56Z

Any brief before/after comparison of the tail loop asm code?

briannwu · 2025-01-20T07:17:09Z

hcman2

Good Opt. If you can share the performance gain for sensitive sizes, it will be much better.

details: 1. Separate tailLoopOpt for A / B: tailLoopOptA / tailLoopOptB. 2. Not supported: DTV, SparseGemm. 3. Reorder load instructions with more vgprs.

aazz44ss

LGTM

aazz44ss · 2025-02-06T01:44:08Z

tensilelite/Tensile/KernelWriter.py

+                               tailLoopOpt2nd == False) else 3
+
+      globalReadMode1st = 3 if tensorParameters1st["isSwizzled"] else globalReadMode1st
+      globalReadMode2nd = 3 if tensorParameters2nd["isSwizzled"] else globalReadMode2nd


Can you put more comments about what are globalReadMode1st and globalReadMode2nd here.

commit 7883b9d060923a1c3d3ee5f2a0846d6f5d17e6b2 Author: Yen Ong <[email protected]> Date: Mon Mar 3 15:21:16 2025 -0600 Add HostLibraryTests to TensileLite commit 15222f77b4ddec94db95d653e9f138f56d1fdd43 Author: Slobodan Josic <[email protected]> Date: Tue Mar 4 10:41:47 2025 +0100 Tuning for gfx1201 (#1715) * Tuned HHS TN gfx1201 GridBased problem sizes used in vLLM based Meta-Llama-3.2-3B-Instruct model * Tuned HHS TN gfx1201 GridBased problem sizes used in vLLM based Meta-Llama-3.1-8B-Instruct model * Tuned BBS TN gfx1201 GridBased problem sizes used in vLLM based Meta-Llama-3.2-3B-Instruct model * Tuned BBS TN gfx1201 GridBased problem sizes used in vLLM based Meta-Llama-3.1-8B-Instruct model * Tuned HHS TN gfx1201 Equality problem sizes used in Meta-Llama-3.1-8B-Instruct, Meta-Llama-3.2-3B-Instruct and Qwen2.5-7B-Instruct Q4_K_M.gguf models * Tuned HSS TN gfx1201 Equality problem sizes used in Meta-Llama-3.1-8B-Instruct, Meta-Llama-3.2-3B-Instruct and Qwen2.5-7B-Instruct Q4_K_M.gguf models * Tuned HHS and HSS TN, NN, NT and TT gfx1201 Equality problem sizes used in Stable-Diffusion-3-Medium model commit 65df63cdbc7a0b6b2da578603341af791414c00d Author: Jinp800125 <[email protected]> Date: Tue Mar 4 15:30:14 2025 +0800 Update FP32 TN/NN/NT logic (#1708) Co-authored-by: victorwu <[email protected]> commit 6b674cf72d59bb2fccb80f9c882a9bd0036e5f7f Author: Ethan <[email protected]> Date: Mon Mar 3 22:56:50 2025 +0800 replace the latest git_info.py commit cc27366d3dae165adf6d2c77c9f72bfef9f200d0 Author: jichang <[email protected]> Date: Wed Feb 26 08:16:56 2025 +0000 Add scaleC/D into hipblaslt-bench commit 101eb2e366f9fa62da537c3b90546c8f448958c3 Author: hcman2 <[email protected]> Date: Mon Mar 3 11:12:54 2025 +0800 Fix Sparse Metadata vgpr alignment. (#1711) commit 402603df7b7347a66b0ed5a2f0c6c8c37a685534 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri Feb 28 16:30:17 2025 -0700 Bump rocm-docs-core from 1.15.0 to 1.17.0 in /docs/sphinx (#1689) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.15.0 to 1.17.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.15.0...v1.17.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 52244d3c253d34b3a813e2aa2fca60d4f961c4e3 Author: Daine McNiven <[email protected]> Date: Thu Feb 27 18:45:10 2025 -0700 Change value_type of AddressWS in asm header from output type to compute type. (#1610) commit 0f4c5e89bf4bc604f3ac26d32d194c73db23ef21 Author: Torre Zuk <[email protected]> Date: Thu Feb 27 16:28:13 2025 -0700 reinterpreted structs to match (#1596) * struct exposed to external API via reinterpret cast so keep the same commit a7f7bd0a3866f394535f5977b246fd4d80c47b68 Author: NaveenElumalaiAMD <[email protected]> Date: Thu Feb 27 12:05:04 2025 -0700 Apply -ftemplate-depth=2048 to tensile_host.cpp if GCC_VERSION is less than 7.6.0 to avoid template recursion issue (#1709) commit bcd40b23b876f7f89220fc9036e51a75f059eff0 Author: KKyang <[email protected]> Date: Wed Feb 26 22:55:27 2025 +0800 Fix pickle copying singleton (#1705) commit 199f5a352644ecbe9c3764ef5d78a446cf093684 Author: KKyang <[email protected]> Date: Wed Feb 26 14:03:49 2025 +0800 Add log to roctx (#1617) commit b01040cd7aa53733c7d150a0998cce3f5c5ca5db Author: jichang <[email protected]> Date: Tue Feb 25 10:06:15 2025 +0000 Dockerfile: build hipblaslt from latest rocm ub22.04 docker image commit e4bcd8a6625fe9d953711ee8817cc568bd36062a Author: KKyang <[email protected]> Date: Wed Feb 26 13:55:09 2025 +0800 Add device support for offline tuning tool (#1692) commit 3791674c0a5bfecd779902fa0012a22f7a7e98d1 Author: KKyang <[email protected]> Date: Wed Feb 26 13:54:47 2025 +0800 Fix compiler error (#1619) The amdclang++ treats allocating 0 length array as error in 6.4. commit f28e499dd1b5982ad75b0a02d836a3b77b119078 Author: KKyang <[email protected]> Date: Wed Feb 26 13:46:45 2025 +0800 Add BSS NT BGRADB for gfx942 80cu (#1613) * Limit gwvw for gradient bias if glvw < gwvw * Add bbs gradient bias b grid for gfx942 80cu commit 5c2e8b7ee5fe975976f30a1b8d37bd1da93a5712 Author: briannwu <[email protected]> Date: Tue Feb 25 11:31:55 2025 +0800 [OPT] Reschedule DTV global read if cases with nrc = 2x (#1678) Detail: Original scheduling way of globalread instructions is to read along perp. dim first and then read in coal. dim. This commit is to reorder instructions to read in coal. dim first. And insert 2 continuous global read instructions if TN and nrc = 2. Note: Disabled if isSwizzled. Enabled if coalesced along K dim. commit 1780bfccf74e37142a9ce8c31b4cad52ccd9d9fc Author: jichang <[email protected]> Date: Tue Feb 4 03:27:07 2025 +0000 apply clang-format commit e3f7a48014ede147163dbeda15512cfcf428e393 Author: Sv. Lockal <[email protected]> Date: Mon Jan 6 06:53:06 2025 +0000 Fix compilation with libc++ `std::vector<char[16]>` works with libstdc++, but fails with libc++. The issue was reported to https://github.com/llvm/llvm-project/issues/98734, and according to discussion, is not a part of c++17 (used currently in hipBLASLt). This change switches `char[16]` to `std::array<char, 16>`, which is compliant with c++17. commit 5fdf27e2f5f25db6c73b1005f08036a7e92e1dc0 Author: Ethan <[email protected]> Date: Tue Feb 25 00:39:53 2025 +0800 Tuning: Update STA TN tuning for HHS and F8NHS Updated GridBased STA HHS & F8NHS yamls commit e59a726d608ce3580e05bd5897b869d221a020cd Author: boringmorning <[email protected]> Date: Sat Feb 22 14:23:52 2025 +0000 Update TN HHS/F8NHS STA logic yamls commit c8fb6ed6e7b9a18eb420c36dedf6978a82deae6d Author: AndySu12 <[email protected]> Date: Fri Feb 21 22:02:54 2025 +0800 Update gfx942 HHS NN/TN Equality yamls for 0210_fp16ranking (#1694) commit a3683f938843f084bef0a986a8c11ee3b6e87574 Author: Feroz <[email protected]> Date: Fri Feb 21 07:18:52 2025 -0600 aquavanjaram942_80cu HHS NN offline tuning improvement with trig_float init (#1691) commit dc1b6a7371d9ff90e0fa4529424f380575e26e9c Author: Jinp800125 <[email protected]> Date: Fri Feb 21 18:23:00 2025 +0800 HSS BSS Grid V3 (#1688) Co-authored-by: victorwu <[email protected]> commit 2fc7f66a0b2a90bfe39619d8ef84d2d8b1b3ddd1 Author: jichang <[email protected]> Date: Thu Feb 20 01:30:56 2025 +0000 correct ISA version of yamls from 940/941 to 942 commit 4240a7fb1edaad6c088cb78b19c1732c054bbb13 Author: George Tseng <[email protected]> Date: Fri Feb 21 17:34:26 2025 +0800 Update TN HHS/F8NHS STA logic yamls. (#1693) commit 6946d82344abafbe3b12118c4345d1f60cd7e9dc Author: Henry Ho <[email protected]> Date: Fri Jan 17 07:40:39 2025 +0000 add gfx942 BF16 STA freesize commit 3ac3eb2438c42cedf2b2985e04f34627ef2dd0c3 Author: Henry Ho <[email protected]> Date: Fri Jan 17 07:06:02 2025 +0000 Support BF16 Swizzle commit 463ab5c7fa1cf7f0c45169f2239a3df5527c0a63 Author: Henry Ho <[email protected]> Date: Fri Feb 14 16:19:01 2025 +0800 fix gfx942 228cu regression commit 0b4eeeab9e1cc81e047e45a2f2ba533404152bf5 Author: Henry Ho <[email protected]> Date: Tue Feb 11 15:51:11 2025 +0800 update equality to fix gfx942 regression commit ff4e80f853ff2fb42bbf9587f72380cc7c87471e Author: xiaohuguo2023 <[email protected]> Date: Thu Feb 20 09:37:50 2025 +0000 update ScheduleIterAlg reject conditions (#1641) commit 65b8f5af80f591ab9e4f9e585f450f93d0fc3d19 Author: jichang <[email protected]> Date: Wed Feb 19 08:01:31 2025 +0000 hipblaslt-bench: Fix output error of atol,rtol if data is inf commit 9f3ea970c32d9d0d6895d005bd5963d3b8014ca7 Author: hcman2 <[email protected]> Date: Thu Feb 20 14:39:07 2025 +0800 Release and allocate VGPR resoures in tail loop. (#1586) 1. Add a VGPR base index definition. 2. Rearange VGPR index order for further optimization. 3. Re-allocate VGPR for tail loop. Fix 6 potential bugs: 1. DTV will use the Valu VGPRs which is released in the beginning of tail loop. 2. BiasSum Valu VGPR should be used when endSum but is released in the beginning of tail loop. 3. _replaceActBranchLabel() always replaces label without postfix. However we should check the label we'd really like to replace with. 4. For DTL, numVgprG2LAllocated is not set so that it will be default=-1. 5. Fix G2LA vgpr allocation bug for navi3x. 6. Fix LocalWrite vgpr index bug. This change is the first step to optimize the VGPR usage in unroll loop. In general, the VGPRs usage in the unrolled loop is dependent from the tail. In tail, the VGPR can be used more effectively. commit bafec92acdb54e1064df22e96c6cb83162f87b9c Author: wencchen <[email protected]> Date: Fri Feb 7 08:34:03 2025 +0000 fix hardware monitor and clk-sys=-nan commit fd76d3f4aa4b1e23b9eb590627fc0b54eec8aabc Author: Ethan <[email protected]> Date: Thu Feb 20 09:16:15 2025 +0800 perf-script: update new data type name (f8n) Updated datatype and added info log commit 0a39d75effba7661e81d5d39ce04de820e0448e3 Author: jichang <[email protected]> Date: Tue Feb 18 07:57:59 2025 +0000 Add Gtests for swizzleA HHS and F8NHS for gfx942 commit e2e68a329024146facb8db4a6f691dcf644298c1 Author: jichang <[email protected]> Date: Tue Feb 18 07:54:03 2025 +0000 Fix: swizzleA problem get non-swizzleA solution from cached library commit cb0c54121f99848736814c52b72894207f902ac8 Author: who who who <[email protected]> Date: Mon Feb 17 13:37:13 2025 +0800 Env enhancement (#1670) * enhance logs * add max mt environment variable commit 9c4c6a06ebcbf32b62517e25091b8209841ff1df Author: KKyang <[email protected]> Date: Mon Feb 17 11:40:02 2025 +0800 Add BBS/HHS/SSS TN/NN/NT Equality got gfx942_64cu (#1667) commit 62a1e1112c4a9b513ecf438e483c34f8b9f86fa9 Author: Alex Brown <[email protected]> Date: Sun Feb 16 09:51:08 2025 -0700 F8 codegen support for stream-k (#1580) commit 8a15ea767c5c8fb5e990312bc7c662b14b48ec18 Author: Vin Huang <[email protected]> Date: Tue Feb 11 08:19:05 2025 +0000 Limit the global read length to depthUM when trying to wider the global read of metatdata commit 8652497c44c165d49b4421539bf636266f35dc48 Author: Vin Huang <[email protected]> Date: Fri Nov 22 12:00:46 2024 +0000 Fix the packBIdx when HasEccHalf is ture but HasWMMA_V1 is false and scheduleIterAlg = 3 commit 3db8677164baeebb1bc368f731b49542637fd45b Author: Vin Huang <[email protected]> Date: Thu Nov 14 12:18:03 2024 +0000 correct pack latency for metadata when MIInputPerThreadMetadata is 4 commit 9c28f8e0b52ea7214b4d76f9c449c1ed88d96110 Author: Vin Huang <[email protected]> Date: Fri Nov 15 10:14:09 2024 +0800 [Sparse] Metadata only use one vgpr in current SMFMA instructions, so set isHigh16Bits and isHigh8Bits to False. commit a99c10047c8be879985dc719ea80e2b4ba1f9504 Author: Vin Huang <[email protected]> Date: Fri Nov 15 10:12:39 2024 +0800 [Sparse] fix the dest vgpr when packing 4 metadata to one vgpr commit 256da07357e791028255bb3684ba4fc513c9fd38 Author: Vin Huang <[email protected]> Date: Tue Nov 5 00:08:59 2024 +0800 fix using wrong vgprG2L when writing metadata to local memory commit 778b4a918835db03127b8229249cb330133a12ca Author: b-shi <[email protected]> Date: Sat Feb 15 08:29:08 2025 -0600 Auto convert ocp f8 to nanoo f8 in hipblaslt-bench on gfx94X (#1654) * Convert ocp f8 to nanoo in hipblaslt-bench commit 7ec4fb940ff9498ed941ffaa1d84b2bb448f6961 Author: Feroz <[email protected]> Date: Fri Feb 14 11:33:21 2025 -0600 aquvanjaram942X Tune BBS TN GEMMs (#1633) commit 07110b7159333aada4b51c66fe3757a54c5bf7fe Author: mengzcai <[email protected]> Date: Tue Feb 11 09:03:08 2025 +0000 client flow support fp8 swizzle add logic yamls, sample add gtest for fp16,fp8 swizzle assert1 commit e3fc8e648d95acb4a67e29ba602c7d700cd103eb Author: Vin Huang <[email protected]> Date: Wed Feb 12 01:48:20 2025 +0000 Store M-Dim and N-Dim's bias and alphaVec information separately. commit 18e54d769f316018dcd4d116bb47f15d9b47adf3 Author: AndySu12 <[email protected]> Date: Fri Feb 14 16:27:52 2025 +0800 Update gfx942 BBS/F8NBS TN Equality yamls for 70b prefill (#1669) commit 2ff718770f5d89c4d2ce46b8a660eb86d38c2712 Author: Peter Cheng <[email protected]> Date: Thu Feb 13 02:42:20 2025 +0000 tensilelite: add silu to bias stride test commit 854109121f2aa6d1e1889f25124b9752bf3ca3c7 Author: Peter Cheng <[email protected]> Date: Wed Feb 12 04:33:13 2025 +0000 change swish epilogue enum value and add EXT suffix commit ae5fabe608c539df4147e68e6acace89de83b753 Author: Peter Cheng <[email protected]> Date: Wed Feb 12 04:16:48 2025 +0000 remove swish from hipblaslt_all commit a051961cb2071ddebf5829b494f8eb1a435fe7f2 Author: Peter Cheng <[email protected]> Date: Wed Feb 12 02:27:03 2025 +0000 free resources before return in swish sample commit 82636c5d4f96228ca91f9b0258895c0961a02cc4 Author: Peter Cheng <[email protected]> Date: Wed Feb 12 02:15:51 2025 +0000 use norm_check instead of unit_check for swish test commit 1a64f9a493b45229a19e1dde23c913577118dbdf Author: Peter Cheng <[email protected]> Date: Tue Feb 11 08:26:40 2025 +0000 add swish sample commit a7040ae87d5c458b80d10e8a170e5c99172a3c04 Author: Peter Cheng <[email protected]> Date: Tue Feb 11 07:09:38 2025 +0000 hipblaslt: enable swish with beta=1 (SiLU) commit 9f978b6672ae19459cb7121f7268956d13d33520 Author: Peter Cheng <[email protected]> Date: Tue Feb 11 06:50:56 2025 +0000 tensilelite: add swish activation commit ec5720788f267663fb4f5db61438850fd11800b4 Author: Braden Stefanuk <[email protected]> Date: Thu Feb 13 09:14:52 2025 -0700 Use 4-tuple-triple for clang-offload-bundler (#1658) commit e4dc7cb08b39c36d89b4d86fb43a71dfc4e0305d Author: Jeffrey Novotny <[email protected]> Date: Thu Feb 13 11:13:07 2025 -0500 Add hipBLASLt Stream-K documentation (#1642) * Add hipBLASLt Stream-K documentation * Incorporate suggestions from internal review * Adjust opening paragraph commit f45e5b9b6550467bd79516779aea99a9adc6cae4 Author: Braden Stefanuk <[email protected]> Date: Wed Feb 12 17:50:08 2025 -0700 Move Common.py to module (#1607) * feat: no global working path with asserts * fix: bad import statement * fix: different build_tmp dir in cmake than tensile * fix: incorrect pathing in Tensile benchmarking * fix: string wrap paths for assert checks * fix: ensure all paths are built * fix: missing global parameter warning * fix: ensure library logic path is built * fix: client writer paths * fix: use child path for client libraries * remove PrintTiming from global params * remove another assert * restore PrintTiming * fix: reviewer comments * refactor: use env var for @timing decorator * refactor: remove library print debug * refactor: remove exit after kernel gen option * style: remove commented code * refactor: move common to a module * feat: remove 'CustomKernelDirectory' global param * fix: amd clang version global param * fix: second pass on amd clang version * fix: imports for bolted on ops scripts * chore: remove safety asserts * chore: create more separation in Common * fix: missing line at eof * fix: basic reviewer comments * chore: move global dependenct funcs to GlobalParameters.py * fix: pass global params to splitArchs * style: don't format caps funcs * fix: bad import * Add TensileLogic program to verify matrix instructions (#3) * Remove unused and static global parameters (#5) * fix: purge library-print-debug * feat: make build paths static * feat: remove 'SortProblems' global param * feat: remove 'ExpandRanges' global param * feat: remove 'WavefrontWidth' global param * feat: remove 'ValidateLibrary' global param * feat: remove 'EnableHalf' and 'ClientArgs' global params * feat: prefer profile decorator over 'Profiler' global param * chore: remove 'LibraryPrintDebug' from build_client.yaml * feat: remove 'MaxFileName' from global params * Update preferred search paths (#4) * fix: remove files updated by precommit in merge * fix: remove log file * Copyright update --------- Co-authored-by: David Dixon <[email protected]> commit 7012cbe4ecb66dd2f085ef6abc5a5249020cf81d Author: Mark Kim <[email protected]> Date: Wed Feb 12 09:59:36 2025 -0600 Gfx942 80cu NN grid update (#1656) * gfx942 80cu HHS NN grid update 8192xNx8192 10240xNx8192 29568xNx8912 8192xNx29568 where N=1-47 * gfx942 80cu HHS NN grid update 8192xNx8192 10240xNx8192 29568xNx8912 8192xNx29568 Where 1<N<~128 (depending) * gfx942 80cu HHS NN grid update 8192xNx8192 10240xNx8192 29568xNx8912 8192xNx29568 Extend N -> 8192 commit dee248b42a475a7c6d47124140d3f1936c3c41d2 Author: aliry95amd <[email protected]> Date: Wed Feb 12 08:58:56 2025 -0700 Tune grid-based GFX942_80CU_F8NBS_TN (#1645) commit a0d4cacc641f544c06c4a7db6bfc147b42e6c544 Author: jichang <[email protected]> Date: Tue Feb 11 08:51:34 2025 +0000 Fix CI errors: don't run layernorm API on un-suppoprted platforms commit 5c5df386ab03784b1b39bb86ba653dcf03dad209 Author: KKyang <[email protected]> Date: Wed Feb 12 19:21:10 2025 +0800 Refactor DepthU == -1 (#1660) 1. Remove deepcopy 2. DIfferent default depthU according to compute data size commit 4a104fd91ba4ca243f7af26b4105ad1313280850 Author: briannwu <[email protected]> Date: Wed Feb 12 16:53:42 2025 +0800 [OPT] Tail Loop Optimization (#1567) details: 1. Separate tailLoopOpt for A / B: tailLoopOptA / tailLoopOptB. 2. Not supported: DTV, SparseGemm. 3. Reorder load instructions with more vgprs. commit db8e93b4fd6c785d4ce6cac11cc3b0ad613ece8b Author: Henry Ho <[email protected]> Date: Tue Feb 11 23:45:12 2025 +0800 gfx942 MAF tuning commit 0cb1d8dd70935e904bc16cc671104b1227982342 Author: jichang <[email protected]> Date: Tue Feb 11 09:52:58 2025 +0000 Remove gfx940/gfx941 from tensilelite commit e4bb811764bda173963ea332a0a33a31809fa92e Author: jichang <[email protected]> Date: Tue Feb 11 09:40:11 2025 +0000 Remove 941 custom kernels commit 833118f1b26dce57553114fc11233b9a41bc2e05 Author: jichang <[email protected]> Date: Tue Feb 11 09:21:17 2025 +0000 Remove gfx940 gfx941 solutions commit 799ca6ea12938689b3be3afe1c1ec672ed51ae92 Author: jichang <[email protected]> Date: Tue Feb 11 09:18:30 2025 +0000 Remove gfx940,gfx941 from hipblaslt lib commit 99922e3fac11993947d59d9750bdcbe17996d2e0 Author: Minsu Kim <[email protected]> Date: Tue Feb 11 19:23:26 2025 -0500 gfx942 80cu equality tuning for HHS NN/TN (#1634) * gfx942 80cu equality tuning for HHS NN/TN * gfx942 80cu equality tuning for HHS NN/TN commit c46e70074a2a639134450752285702de9b0a3160 Author: KKyang <[email protected]> Date: Tue Feb 11 21:43:36 2025 +0800 Add gfx942 64cu TF32 NN Equality (#1643) commit 134872438ab51423c70012c4f9d7876e81af180d Author: Jinp800125 <[email protected]> Date: Tue Feb 11 17:13:32 2025 +0800 restore gfx942 regression (#1638) Co-authored-by: victorwu <[email protected]> commit 918cabd8fba46ef504023667c9ab373d289aa5cf Author: b-shi <[email protected]> Date: Tue Feb 11 01:51:11 2025 -0600 Fix incorrect local index values in MatchTable.yaml (#1640) commit ac4278d8c90691422d5afdc98dfa32a4fd6e38bc Author: jichang <[email protected]> Date: Mon Feb 10 04:26:05 2025 +0000 Fix CI errors: Remove F8 Compability checking from hipblaslt runtime commit ede613fef29d521287357851712ec182fd96a1c3 Author: jichang <[email protected]> Date: Fri Feb 7 08:32:26 2025 +0000 Don't treat "no solution" as error in client tools commit 16a73cac0f11a75a5628c9fb2855fd139c8d3412 Author: KKyang <[email protected]> Date: Mon Feb 10 14:48:15 2025 +0800 Add DepthU == -1 (#1627) * Add DepthU == -1 * Update SolutionStructs.py commit 3ff258f0bb3f2d8751e0380705741384eac2f2dd Author: AndySu12 <[email protected]> Date: Mon Feb 10 11:55:52 2025 +0800 gfx942 BBS Equality/GridBased update (#1635) * Update gfx942 BBS NN/NT/TN Equality yamls for moe5 training * Update gfx942 BBS NN/NT/TN GridBased yamls for grid v3 fix * Update gfx942 BBS TN GridBased yaml for genai 0122 commit 0200ac211b4f080ae41be771d046bbec5b902b13 Author: Serge Lu <[email protected]> Date: Fri Feb 7 10:46:57 2025 +0000 Added missing swizzle argument commit b3904583933d4c76147471f542ce16308d6937ed Author: Ethan <[email protected]> Date: Fri Feb 7 14:25:13 2025 +0800 feature: DTVB with Swizzling (tensorB) * implemented swz-b fixed wave_id distribution * added pytests * fix swizzleB padding * add new datatype for swizzledB commit 294161ce63e4fd8a643056860b1736d26f3537ae Author: Serge Lu <[email protected]> Date: Thu Feb 6 17:18:25 2025 +0800 Swizzle support for hipBLASLt (#1584) * Initial draft for hipBLASLt tensor swizzling * Refined tensor-swizzling example * Fixed incorrect check of opB and swizzleB * Draft version of tensor swizzling A for hipblaslt-bench * Added free size logic yamls for swizzle-a for demostration * hipblaslt-bench: copy init buffer back to CPU for swizzle case * update siwzzleA HHS/HSS * clean BiasDataTypeList from swizzle pure gemm yaml * update AF0EM,AF1EM and ASEM into swizzle logic yaml * Added C++ getHeuristic sample for swizzle-A * update HSS STA freesize * Amended swizzle example with padding * Added weight swizzle and pad example * Fixed auto-padding in hipblaslt-bench * Updated shared headers in clients folder * Renamed enum for swizzled tensor * Make swizzleB in hipBLASLt API properly propagate * Amended document of swizzle for hipBLASLT API * Fixed typo in hipblaslt.h * Added swizzle enums and updated API reference * Added hipblasLtOrder_t to datatypes doc * Updated logics for swizzle-A kernels * Update docs/api-reference.rst --------- Co-authored-by: jichang <[email protected]> Co-authored-by: Jeffrey Novotny <[email protected]> commit f7c39ab7ed77cd23e78a24808e33eb3c1d779cda Author: jichang <[email protected]> Date: Wed Feb 5 07:26:35 2025 +0000 gtest: add alpha = 2 for dgemm test commit 3cfe2ae2ed2da284b27bb95508829d9eca9af02b Author: jichang <[email protected]> Date: Wed Feb 5 07:25:07 2025 +0000 Fix: dgemm result is incorrect if alpha is not 1 commit db9778cd9c5a0786ec3fc9ff881abf92b407dec1 Author: Wayne Huang <[email protected]> Date: Thu Jan 16 07:11:31 2025 +0000 Move max frequency retrieval to the begining and add manual input when error. commit 24e58eeda0ff8003e1bd3f7d7530209eaa0797bf Author: b-shi <[email protected]> Date: Wed Feb 5 22:34:44 2025 -0600 Fix f8 related ci errors (#1625) commit d2fbb96f1d4b5c95ddb9663b083de98c8754285e Author: who who who <[email protected]> Date: Thu Feb 6 11:42:53 2025 +0800 Optimize generator (#1540) * use process pool accelerate process * beautify code * limit process number * update readme * fix bugs * add missing parentheses * add missing almalinux in install script * add hint into the script --------- Co-authored-by: root <[email protected]> commit c7dd1806ac5a99db97fe21cca0f17f5e5eb5a400 Author: jichang <[email protected]> Date: Tue Feb 4 09:58:23 2025 +0000 Fix: incorrect ldd commit 104249ea2729eb980d8e9442d5301df579b28dae Author: Hao-Sheng Chen <[email protected]> Date: Fri Jan 24 06:23:06 2025 +0000 Optimize type checking commit fff2bc21553546f409275ab2f94cfb17d1948134 Author: Peter Cheng <[email protected]> Date: Fri Jan 24 14:19:16 2025 +0800 reject SingleBuffer + GSU = 1 commit a19c66ba1f8a0ce2bb22b1e62dc7cf9642e0b4bc Author: Torre Zuk <[email protected]> Date: Mon Feb 3 11:35:26 2025 -0700 clear normal operation hip error code (#1605) * backward and forward hip compatible commit 1f54baff121f23e1fadf4d049e0a12c68a58fcef Author: Torre Zuk <[email protected]> Date: Mon Feb 3 00:44:58 2025 -0700 Fix compiler warnings (#1601) commit 6fb17a6990009d5320e7de5c7adc14be8aaa341f Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu Jan 30 10:17:59 2025 +0000 Bump rocm-docs-core from 1.13.0 to 1.15.0 in /docs/sphinx Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.13.0 to 1.15.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.13.0...v1.15.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> commit 36a97b206f548effbfbf0e7ea23652779b3ea5b5 Author: David Dixon <[email protected]> Date: Fri Jan 31 20:37:24 2025 -0700 Additional TCL diagnostics (#1608) commit 21ec8f9ffedae673a25edb5308aaf983dd5b08f7 Author: NaveenElumalaiAMD <[email protected]> Date: Fri Jan 31 10:07:47 2025 -0700 Add extended profile logging along with flush and rotating size (#1389) * Add extended profile logging and add flush, rotating size to logging * clang-format * Add flush and rotating data to logging * Minor changes * Address Torre's comments * Doc update with name changes * Add cold and hot iters count to bench/profile/extended_profile and some cleanup commit c107a62d5920bd6b41e42bd4e84ec91e9fc74e9c Author: Braden Stefanuk <[email protected]> Date: Fri Jan 31 09:17:23 2025 -0700 Remove global working path (#1546) * feat: no global working path with asserts * fix: different build_tmp dir in cmake than tensile * fix: incorrect pathing in Tensile benchmarking * fix: ensure all paths are built * fix: missing global parameter warning * fix: ensure library logic path is built * fix: client writer paths * fix: use child path for client libraries commit d9063a8a924d46170c7bd9cad460a02e6dc92ced Author: b-shi <[email protected]> Date: Thu Jan 30 16:16:24 2025 -0600 Make global soln indices deterministic (#1599) * Make global soln index values deterministic commit 675d0e335a58f6ee8cb48649e19c28076b26ea7e Author: David Dixon <[email protected]> Date: Thu Jan 30 09:08:21 2025 -0700 Remove Literal for python36 (#1602) commit f414c101cd294d81f4a272478ba70ac390164532 Author: Torre Zuk <[email protected]> Date: Wed Jan 29 09:27:24 2025 -0700 hipblas-common INTERFACE (#1600) * hipblas-common INTERFACE --------- Co-authored-by: Zuk <[email protected]> commit 87e165a823399758d944191e94208e1d4cfd4314 Author: David Dixon <[email protected]> Date: Wed Jan 29 08:44:38 2025 -0700 Restore build_tmp removal (#1595) * Restore build_tmp removal commit 832f8e292201380a6b8c72edb1ee92160fac3cae Author: David Dixon <[email protected]> Date: Wed Jan 29 08:43:49 2025 -0700 Hotfix lazy library config (#1598) * Fix no-lazy-library-loading option * Update host code to account for library naming commit cffa30e536e8076acd6e2112928da29f74f1abc8 Author: b-shi <[email protected]> Date: Tue Jan 28 19:35:03 2025 -0600 Incorporate new F8 design (#1577) Incorporate new F8 design - Added new f8 headers (with backward compatibility) - Create distinct f8 data types for NANOO and OCP in Tensilelite - Keep default naming for F8 OCP and add extension in default name for NANOO --------- Co-authored-by: Majed Sujon [email protected] Co-authored-by: brian [email protected] Co-authored-by: David Dixon [email protected] commit 976bb4880ff60e37ce40fce0b7fbb767a69a58b1 Author: Pieter Ghysels <[email protected]> Date: Tue Jan 28 07:49:41 2025 -0800 Fix device initialization 2^32 element limitation (#1552) * Fix device initialization 2^32 element limitation * Do not assume padding for the last block in the batch commit 13ccd58c5804ee5f2928639d9fdb09f3627738c4 Author: Pieter Ghysels <[email protected]> Date: Tue Jan 28 07:49:07 2025 -0800 Modify trig initialization on device to remove dependency on lda. (#1543) commit 403cb3900542d2a0462c6a8aa24f9ee2ba38b43c Author: KKyang <[email protected]> Date: Mon Jan 27 22:50:13 2025 +0800 Remove redundant code for gwvw > 1 route (#1573) commit 258a2162fbf63bab3f846e3227e6155eea38a5a7 Author: AndySu12 <[email protected]> Date: Fri Jan 24 23:43:41 2025 +0800 Update gfx942 BBS NN/NT/TN Equality yamls for training sizes (#1594) commit 0e1efffe53661da051a5167d687e84b4720e2e82 Author: root <[email protected]> Date: Wed Jan 22 04:44:13 2025 +0000 Update Gridebase v3 F8 TN commit 1a84965d8c3ec66c8f609fc56f1c7a56785e74b6 Author: victorwu <[email protected]> Date: Thu Jan 23 15:01:37 2025 +0000 Update Gridebase v3 BBS NN/NT/TN commit 381443e630bb09c406ffddc8dca19a65340f3721 Author: victorwu <[email protected]> Date: Thu Jan 23 15:00:34 2025 +0000 Update Gridebase v3 HHS NN/NT/TN commit 7b9e2dba6c7919a0c3d73d021fda6d74458878b5 Author: mengzcai <[email protected]> Date: Tue Jan 21 05:39:48 2025 +0000 Support other types for Swizzling commit a11ccf64efcd818106dbe37768f69dfcc0a7ff22 Author: smalekta <[email protected]> Date: Wed Jan 22 15:36:09 2025 -0500 added eq tuning for gfx942 F8HS TN (#1579) commit d6d32e2c10235bc0c35eefd1623daf68c41071d6 Author: Parth Kumar <[email protected]> Date: Wed Jan 22 10:39:48 2025 -0600 Added kernels for HHS & BBS (#1576) commit d3e76ac89ac8165169f1b2945c63120cd11a4fda Author: Alex Brown <[email protected]> Date: Wed Jan 22 09:08:52 2025 -0700 Stream-k libs for CPX mode (#1568) commit 167eb6b08c359091688a7d1e4db012d582746598 Author: jichang <[email protected]> Date: Mon Jan 20 06:39:35 2025 +0000 hipblaslt-bench: throw error if c_type is not equal to d_type commit bec6dca8fd54144a09b69d3c92c79ab4f6dff46f Author: Serge Lu <[email protected]> Date: Mon Jan 20 13:58:10 2025 +0800 [TensileLite] Support arbitrary M & K for swizzle-A kernels (#1558) * Implemented auto-padding mechanism for swizzle-A * Support arbitrary M and K for swizzle-A * Fixed tail loop for swizzle-A for arbitrary M & K * Added edge test cases for swizzle-A * Restore part of tail loop optimization for swizzle-A * Re-enabled optimization of tail loop for swizzle-A * Fixed segfault when using GuardPageBack for swizzle-A * Removed size hacks for swizzle-{A, B} * Removed guard-K of A for tail loop of swizzle-A kernel commit 48b21d267a2797b01d37e8769a808776bc50b665 Author: Josh Chang <[email protected]> Date: Mon Jan 20 09:39:53 2025 +0800 Optimize preloop by v_lshl_add (#1564) commit 01839feac5d36dbfe4970c82250ecb6d2af4f6ed Author: David Dixon <[email protected]> Date: Fri Jan 17 11:57:51 2025 -0700 Factor out argument parsing in TensileCreateLibrary (#1514) * Make TensileCreateLibrary a directory and add TCL specific ParseArguments module * Add exported functions to module * Add reasonable defaults * Update cmake support * Update docs string commit 38efb62a7bad945dc41535e4ce9fe0f5e3530996 Author: Ethan <[email protected]> Date: Fri Jan 17 12:16:59 2025 +0800 code-gen: Allowed WaveGroups be distributed along n-dim for DTVA/SwizzledA * Allow WaveGroup in N-dim for swizzledA * directly modify totalElementsCoal/PerpA * restore assertion and rejection * Added DTVB & updated pytests * Fix TLU=True case commit 0357bb47e653a1d37bfdfd1101a391fb12d04243 Author: smalekta <[email protected]> Date: Thu Jan 16 17:02:20 2025 -0500 added equality tuning for F8HS_TN (#1554) commit 1be8e65306ce9834e7c361bd1ab140e9da9fe122 Author: Braden Stefanuk <[email protected]> Date: Thu Jan 16 09:40:38 2025 -0700 Fix default code object version (#1553) commit 15c618e244b58a2a5f16c5cb63add600ed61c805 Author: Henry Ho <[email protected]> Date: Thu Jan 16 22:49:21 2025 +0900 gfx942 BBS F8B8BS F8BS equality tuning (#1551) * fix BBS/HHS NN equality yaml * equality tuning BBS F8B8BS F8BS commit cf68a1b63638a2b3d9ebe748b56bdb0e37e056ea Author: Josh Chang <[email protected]> Date: Thu Jan 16 16:35:50 2025 +0800 fix type in install.sh (#1560) Co-authored-by: root <[email protected]> commit eb00cc86c1101eeaa07d03e2e120c2fe507183e3 Author: Josh Chang <[email protected]> Date: Thu Jan 16 16:20:20 2025 +0800 Install msgpack dependency for CentOS8 (#1559) Co-authored-by: root <[email protected]> commit e5426e879d529c6e76b7eba77e96b6275a17139a Author: Josh Chang <[email protected]> Date: Thu Jan 16 11:44:54 2025 +0800 Check destination folder with yaml attribute while merging (#1555) 1. Only check Equality and GridBased folders 2. Can't prevent manually changed yaml attribute commit 0ac3cb18b1f7bd98eb0569f777778fa7bb697421 Author: KKyang <[email protected]> Date: Thu Jan 16 09:42:23 2025 +0800 Use B64 instead of B32 (#1548) commit 305a144636584e08b39e114c5af6c513fa4b69fa Author: hcman2 <[email protected]> Date: Wed Jan 15 18:38:03 2025 +0800 Update BBS NN/NT/TN Equality yamls. (#1549) commit bb0cce368f5f86d40169d40cb7fe2090ae8dd0e7 Author: jichang <[email protected]> Date: Mon Jan 13 06:04:22 2025 +0000 hipblaslt-bench: only print device caps of target device commit ce603f2fb3286953388f4a7d71f4fcb67ec20d1f Author: KKyang <[email protected]> Date: Tue Jan 14 13:06:04 2025 +0800 Fix if returned index exceeds 32-bit in TensileLite client (#1538) commit 579164a8f2bd6f9ebc7161a7e8bb7c9c598cc593 Author: Henry Ho <[email protected]> Date: Mon Jan 13 08:30:00 2025 +0000 xf32 TN/NT/equality bmm gridbased update commit d8edc091eddcd5ec6af80ef54f9e0abc01e0cf3b Author: Josh Chang <[email protected]> Date: Tue Jan 14 10:46:56 2025 +0800 Equality Tuning: F16 F32 NN TN TT (#1542) commit e01241ab96209555bdfab3e70e00e039dd81a99e Author: Jinp800125 <[email protected]> Date: Tue Jan 14 10:34:51 2025 +0800 SyncUp Gridebase HHS from BBS (#1504) Co-authored-by: victorwu <[email protected]> commit 67f4da618ec51de724df283c0af0caa0cc567f06 Author: wencchen <[email protected]> Date: Wed Jan 8 22:17:09 2025 -0500 fix frequency monitor commit 7e0e4be3045dd075af0d4f4893cca49267658f9a Author: George Tseng <[email protected]> Date: Mon Jan 13 11:52:05 2025 +0800 Add emulation smoke/regression/extended tests. (#1533) * Update smoke/regression tests. * Update smoke_gtest.yaml * Add extended tests. commit 7d25c31fd2b386f38b4d99910b456561c11afc1a Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri Jan 10 17:09:07 2025 -0700 Bump rocm-docs-core from 1.12.1 to 1.13.0 in /docs/sphinx (#1522) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.1 to 1.13.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.1...v1.13.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 94dc2948da4e9602918a15e899401afb9594d2fe Author: David Dixon <[email protected]> Date: Fri Jan 10 16:12:56 2025 -0700 Feature/refactor compile and link fxns (#1507) * add toolchain files * Remove build commands * Improvements to build algorithm * Don't build duplicates * Copyright commit c77a900089a891e82739e95e7b66f85134dc237e Author: wenchuanchen <[email protected]> Date: Fri Jan 10 22:34:59 2025 +0800 fix f32 mac ldspad=-1 (#1530) commit 1a2b9f0bdf34be40432bda244c5f06b0762ec7d5 Author: Henry Ho <[email protected]> Date: Wed Jan 8 11:46:52 2025 +0800 fix incorrect gridbased logic commit 7b9972cdc97ea553a6e159fe267faf8995efc62a Author: hcman2 <[email protected]> Date: Fri Jan 10 14:53:11 2025 +0800 Update NN NT TN Equality yamls. (#1532) commit 5ff31e7e72dfc5f8ccb7aeb7bc2ad4f652309049 Author: AndySu12 <[email protected]> Date: Fri Jan 10 14:52:01 2025 +0800 Update gfx942 F8B8BS/F8BS TN Equality yamls for models (#1534) commit 2b26942ad8ffbf37969efcfa148cdbe3698aebb2 Author: KKyang <[email protected]> Date: Fri Jan 10 11:22:54 2025 +0800 Use v_pk_mul_f32 for short store alpha calculations (#1524) commit 2dc6dacffc8ca8159485a18ab13f28d6ecc7d2b6 Author: hcman2 <[email protected]> Date: Thu Jan 9 11:02:16 2025 +0800 Update TN F8B8BS equality yaml. (#1527) commit cd917e0b7cf6944b241d6cb9545d22547d06107f Author: Parth Kumar <[email protected]> Date: Wed Jan 8 19:09:06 2025 -0600 Added kernels for HHS BBS (#1473) commit 9b2ff15b1d7d03214f3894067902921faf11a57c Author: Serge Lu <[email protected]> Date: Wed Jan 8 18:17:33 2025 +0800 Added multiple devices support for matrix transform (#1338) * Added multiple devices support for matrix transform * Restore selected device for multi-device test of matrix transform commit 62597c0a057f3c6c4d3dcdb4e415c2e306ef3109 Author: Josh Chang <[email protected]> Date: Wed Jan 8 14:02:55 2025 +0800 gfx942_80cu BBS NN NT Tuning Release (#1526) 1. Large K and large N cases commit 14b2226b0281d9d6c72dfa15e8a89a1dcdf4c4d0 Author: KKyang <[email protected]> Date: Wed Jan 8 09:42:58 2025 +0800 Use v_pk_mul_f32 for long store alpha multiplications (#1517) commit d61fc8c333c636fcf3b77c52d1e59e49e274d350 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue Jan 7 12:29:52 2025 +0800 Bump rocm-docs-core from 1.11.0 to 1.12.1 in /docs/sphinx (#1511) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.11.0 to 1.12.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.11.0...v1.12.1) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 78ec86229999aea66e86d70cf17ec1f6a11d0f87 Author: jichangjichang <[email protected]> Date: Tue Jan 7 11:52:34 2025 +0800 update lib version as 0.13 (#1508) commit e03b1e964d8f0774fe49d176096b418f93b577c8 Author: Jeffrey Novotny <[email protected]> Date: Mon Jan 6 22:20:22 2025 -0500 Update license file for 2025 (#1519) * Update license file for 2025 * Add current year as end date instead of changing the current year in copyright commit d4c158fdf1eb990fb1f12f5c5172b72c6619c048 Author: hcman2 <[email protected]> Date: Tue Jan 7 08:29:31 2025 +0800 Update TN F8BS Equality yaml. (#1518) commit 422b5e06d472aea54d6d9238ed51670bdf5ef845 Author: David Dixon <[email protected]> Date: Mon Jan 6 09:42:17 2025 -0700 Remove merge files options from cmakelists (#1513) commit bafc3108fcc5d3793c115ea8d748a82e0168078d Author: KKyang <[email protected]> Date: Mon Jan 6 15:53:49 2025 +0800 Create README.md for TensileLite (#1466) commit c1034284f4fd43b4adbc587d1adaabde68fa3989 Author: AndySu12 <[email protected]> Date: Mon Jan 6 15:28:54 2025 +0800 Update gfx942 BBS/F8F8S NN/NT/TN Equality yamls for models (#1510) commit 7ad3a47ed3898c60d438399130ca62e1e8ad4751 Author: KKyang <[email protected]> Date: Mon Jan 6 15:23:44 2025 +0800 Update BBS NN/NT/TN equality tuning for gfx942_80cu (#1467) commit 79d146710dd05045e6436df5408f43fa5227454a Author: victorwu <[email protected]> Date: Thu Jan 2 05:51:31 2025 +0000 TF32 NN NT TN commit e4d2bc48af711ac31dcc8575b3e6534458b3dc28 Author: KKyang <[email protected]> Date: Mon Jan 6 10:17:04 2025 +0800 Add BBS support for find_exact.py (#1465) commit d6449b6f17868f0d5801651cc85e237c345f017b Author: hcman2 <[email protected]> Date: Fri Jan 3 15:26:00 2025 +0800 Update TN BBS Equality yaml. (#1505) commit 0aa831f0bc756dddda44656aa1360e2136256c7b Author: George Tseng <[email protected]> Date: Fri Jan 3 14:25:23 2025 +0800 Disable non-supported features for DTVA. (#1501) * Disable non-supported features for DTVA. * Update for tail loop cases. * Update to UseSgprForGRO: [0]. commit 1b0b1c5a3c5582897dc3ef13f0dae5795c819aab Author: David Dixon <[email protected]> Date: Thu Jan 2 17:07:52 2025 -0700 Remove merge-files option (#1407) * Remove merge-files option * Update yaml files in tests * remove merge files from build commands commit 95b1d7bfede4ee808b26249cf54012ba6ac7c915 Author: cm chen <[email protected]> Date: Wed Dec 18 04:42:15 2024 -0800 gfx12 - enlarging tolerance is not needed commit 964b4f3a17691d054518c21a974160cb8e192ca6 Author: hcman2 <[email protected]> Date: Thu Jan 2 09:59:51 2025 +0800 Update NN/NT BBS Equality yamls. (#1498) commit 582a9889af6513c6eb813b8df5e5983b01cea892 Author: jichang <[email protected]> Date: Mon Dec 30 10:03:55 2024 +0000 update changelog commit 5400c61b7ece89d83c964e08ce491c462d481f54 Author: jichang <[email protected]> Date: Mon Dec 30 07:00:40 2024 +0000 Fix: deps folder is incorrect commit 4d6dc03f410ee1f2d7ee32f0e2e6ad61d5e4a294 Author: Daine McNiven <[email protected]> Date: Mon Dec 30 09:01:07 2024 -0700 Add reject states for failing streamk params (#1425) * Add reject states for failing streamk params. * Add scheuldeIteralg == 1 as valid for streamk. commit 3d5765fd0b24d215f8d762aeffdd6f855081e276 Author: AndySu12 <[email protected]> Date: Fri Dec 27 23:58:02 2024 +0800 Update gfx942 HHS/BBS/F8BS/F8B8BS NN/NT/TN Equality/GridBased yamls for models (#1499) commit a9667ff6dfaf5a571b9f7e4e589067db1111971d Author: Daine McNiven <[email protected]> Date: Fri Dec 27 04:54:01 2024 -0700 Support conjugate-transpose as equivalent to transpose (#1429) * Support conjugate-transpose * Convert conj-transpose to transpose during problem construction. * ammendment to last commit * Remove unneeded use of arg.trans in aux tests. commit cbb0c2ded50b6c0f0ae7ed8e217c653414a44fde Author: Hao-Sheng Chen <[email protected]> Date: Fri Dec 27 15:20:29 2024 +0000 Modify the tests to separate different hardware behavior commit 0ebbb877d7ed792b99bd5ab32632ce52551966b2 Author: Hao-Sheng Chen <[email protected]> Date: Fri Dec 27 12:23:31 2024 +0000 Fix incorrect verification method commit 3bdace8bc0035d2104ed3bc7c9d7258f6465dc03 Author: Hao-Sheng Chen <[email protected]> Date: Tue Dec 24 11:56:10 2024 +0000 Fix AMDClangVersion isn't assigned before running Tensile commit 098f9e0c45fcc9e09e397be5111b90fb7778b331 Author: Hao-Sheng Chen <[email protected]> Date: Tue Dec 24 07:05:06 2024 +0000 Fix the incorrect relative build path when it's not at the root level commit de4c20447136f2b4940ca25224e0bf2fc68f89e9 Author: Hao-Sheng Chen <[email protected]> Date: Tue Dec 17 08:54:45 2024 +0000 Fix garbage value of bias_type commit 260c9ceffe3e77e9ea5413022a4c8aee29780049 Author: Serge Lu <[email protected]> Date: Fri Dec 27 18:24:57 2024 +0800 Fixed random failures in MBSK post-loop (#1497) * Fixed missing srd increment for MBSK kernel * Re-enabled large sizes test cases for swizzle-A * Removed unnecessary changes commit 4311f2d4af1bce6f857c8e0fbcf3bee7c807e0f2 Author: Jinp800125 <[email protected]> Date: Fri Dec 27 14:14:58 2024 +0800 update 942 7 range F8HS Gridebase (#1476) Co-authored-by: victorwu <[email protected]> commit 4f1215afc895ceec63a8d584429f5efe0ac57f27 Author: hcman2 <[email protected]> Date: Fri Dec 27 04:46:42 2024 +0800 Remove some Equality problems. (#1494) Remove the sizes that gridbased is better. commit d633749f107dbddd03f3a262adc484b6a5197e04 Author: briannwu <[email protected]> Date: Thu Dec 26 16:32:58 2024 +0800 Using i32*u32->i64 when removing staggerU offset (#1487) Fix: memory access fault in tail loop with extreme larger size commit 2b8c9d8c604a530fd9316f8d4cb7af663aa0bed5 Author: tsenwang <[email protected]> Date: Wed Dec 25 08:08:01 2024 +0000 cmake option commit c883079a47aea8316cafaff4f142ad6d2250debe Author: jichang <[email protected]> Date: Tue Dec 24 09:57:25 2024 +0000 Fix : seg fault due to overflow in Uint32 Remainder calculation commit 336aa6967910c8d0cca029986ee0781dbd0380e5 Author: hcman2 <[email protected]> Date: Thu Dec 26 00:29:31 2024 +0800 Update TN input F8/F8B8/B8F8 and output F8/B8/B/H/S logic yaml. (#1489) 1. modify the incorrect kernel name of aquavanjaram_Cijk_Alik_Bljk_F8B8BS_BH_BiasSB_HAS_SAB_SAV_UserArgs and aquavanjaram_Cijk_Alik_Bljk_F8BS_BH_BiasSB_HAS_SAB_SAV_UserArgs 2. Update and add other yamls. commit 26eba0f556c7cf4c7ecb976e4cd31b25f592340d Author: Josh Chang <[email protected]> Date: Wed Dec 25 17:13:28 2024 +0800 Fix returnAlgoCount typo (#1492) commit 1893cdf267ec77bdcb12c149963bf34e63d85ff6 Author: Jinp800125 <[email protected]> Date: Wed Dec 25 09:45:05 2024 +0800 Update F8B8BS and F8BS TN BBS NN, NT, TN Gridebase (#1486) Co-authored-by: victorwu <[email protected]> commit 5997a6c7f72b575778b45d600ab07eb17628e6f1 Author: hcman2 <[email protected]> Date: Tue Dec 24 20:22:36 2024 +0800 Update F8B8BS and F8BS TN logic yamls. (#1482) Update 2 logic yamls from row-wise logic yaml. commit 473d638e2b568774aeb7609073828a3af0f66205 Author: jichang <[email protected]> Date: Mon Dec 23 10:37:29 2024 +0000 Fix: seg fault when execute some solutions with extreme larger size commit 3ef4fb26ca098ec0d5b4df68b7151fd71b59c229 Author: hcman2 <[email protected]> Date: Tue Dec 24 10:33:06 2024 +0800 Update F8F8S logic yaml and enable GSU+F8 (#1479) 1. Remove the rejection of GSU+F8 output. 2. Update aquavanjaram_Cijk_Alik_Bljk_F8F8S_BH_BiasSH_HAS_SAB_SCD_SAV_UserArgs.yaml commit 61c3ba5ab7f8b5aede8bc678548db05e656c1181 Author: mahmoodw <[email protected]> Date: Mon Dec 23 10:39:25 2024 -0700 Changes to exclude StreamK by default (#1471) Co-authored-by: mahmoodw <[email protected]> commit 09ba034b3148a25f4b876e1a2f9192ab585530f8 Author: hcman2 <[email protected]> Date: Mon Dec 23 10:04:20 2024 +0800 Use archVGPR when accVGPR is not enough. (#1460) This PR is to support larger MT such as 256x320. Generally, we only have 256 accVGPRs. If MT is larger than 256x256, we need some extra archVGPR to store the acc results. commit 0c8494a3d569a996330c15fac4dde9ff8ce8691a Author: AndySu12 <[email protected]> Date: Mon Dec 23 03:10:20 2024 +0800 Update gfx942 F8F8S TN Equality/GridBased yamls for torch._scaled_mm fix (#1477) commit 0579107e30a2ef5b7970e288a20f21e81ac559fe Author: smalekta <[email protected]> Date: Fri Dec 20 15:29:30 2024 -0500 Gfx942 80cu grid based and equality tuning for HHS NN (#1474) * adding point to fix the grid before tuning * added point to HHS_NN grid for 80cu to make the grid denser * adding a size equality HHS_NN for 80cu commit f5fd734754526c0e5398bf6a92cfb26c1ab103bf Author: Feroz <[email protected]> Date: Fri Dec 20 08:54:39 2024 -0600 Aquavanjaram 20CU equality GEMMM tuning updates for TF32 NN and TN data type (#1470) * Aquavanjaram 20CU equality NN TN TF32 GEMM sizes tuned commit 5ca877f6aeccdba1aba99235fef2bb1be62a6768 Author: Ethan <[email protected]> Date: Fri Dec 20 12:41:33 2024 +0800 code-gen: improved tail loop and edge tile of swizzled A * Opt swizzleA tail-loop and minor bug fix * Add test cases for tail loop * disable big sizes commit 6b413e322e57660a7c6803d59a321bd96530521c Author: Hao-Sheng Chen <[email protected]> Date: Fri Dec 20 10:30:10 2024 +0800 Fix incorrect type casting for alpha and beta in f16 compute type (#1462) * Fix incorrect type casting for alpha and beta in f16 compute type commit 73a04a0b69ea27e87542317d9fff556a140c2045 Author: mengzcai <[email protected]> Date: Tue Dec 10 17:59:50 2024 +0000 Support UseSgprForGRO for dtva commit 322d6b10308ba88b6552847a9cd5759bf049b168 Author: George Tseng <[email protected]> Date: Thu Dec 19 11:47:18 2024 +0800 Update gfx942 NT/TN/NN FP8/BF8/BF16 Equality (#1463) * update 48 Equality logic yaml sizes. * update bbs logic yamls. commit bffef5e46d9acc80db5daa6d0c4579dfbd772b85 Author: Josh Chang <[email protected]> Date: Wed Dec 18 16:20:22 2024 +0800 gfx942_80cu BBS NN NT TN Tuning Release (#1459) 1. Large K and large N cases commit 856fec23bb11d12e382273badf67fbe1bf6cfda3 Author: Hao-Sheng Chen <[email protected]> Date: Wed Dec 18 15:59:04 2024 +0800 Revert "Fix incorrect type casting for alpha and beta in f16 compute type" This reverts commit 3600099416461e108e5e14c6e49e30acfb66c9e9. commit aa3a2ec5e572cb7c0868640fcb4ca263a8340499 Author: NaveenElumalaiAMD <[email protected]> Date: Tue Dec 17 10:52:14 2024 -0700 Fix invalid string printed when running hipblaslt-test (#1428) * Fix invalid printed when running hipblaslt-test * return -1 for bad values commit 3600099416461e108e5e14c6e49e30acfb66c9e9 Author: Hao-Sheng Chen <[email protected]> Date: Sun Dec 15 20:33:57 2024 +0000 Fix incorrect type casting for alpha and beta in f16 compute type commit 4d40e360f2a6b82888451e2669a456ba1717932a Author: hcman2 <[email protected]> Date: Tue Dec 17 10:39:21 2024 +0800 Update solutions for hstu bmm 3 sizes. (#1453) commit b61a09590ee8f437055086f29156c704a68f6385 Author: KKyang <[email protected]> Date: Tue Dec 17 10:38:48 2024 +0800 Update gfx942_80cu NT/TN/NN f16/f32 Equality (#1452) commit 83930f22138831b276ffaa8f0d39075fe425a268 Author: Mark Kim <[email protected]> Date: Mon Dec 16 12:54:23 2024 -0600 gfx942 38cu HSS/BSS NN TN NT grid tune (#1448) commit b089e406cdcc469de1fa829ec6926ea530f32791 Author: Mark Kim <[email protected]> Date: Mon Dec 16 09:04:18 2024 -0600 gfx942 38cu SGEMM NN TN NT grid tune (#1439) commit ba818b0e7faffa76587bd8ab0ad175c44fb0d68f Author: Mark Kim <[email protected]> Date: Mon Dec 16 09:04:02 2024 -0600 gfx942 38cu F8HS NN TN NT grid tune (#1440) commit 71295a0e9f888938be6f9c8939ac80a24752b23c Author: Vin Huang <[email protected]> Date: Sat Dec 14 13:34:32 2024 +0000 Refactory Sparse Test in Tensilelite commit 41767d9554163403ff9c54d5f07aa2b0d14d110a Author: KKyang <[email protected]> Date: Mon Dec 16 13:46:04 2024 +0800 [Hotfix] correct occupancy calculation (#1451) commit 0ed9795b44579c98218bcdf20cf3f41aa5e622b9 Author: David Dixon <[email protected]> Date: Sat Dec 14 16:06:17 2024 +0000 glod for yaml and co in client writer commit b263c6bf45a99cdd374b77798114f6f08413d5c0 Author: David Dixon <[email protected]> Date: Fri Dec 13 17:31:55 2024 +0000 remove warning commit b79b0320b56b8924093811386287cfee9d5d6a74 Author: David Dixon <[email protected]> Date: Fri Dec 13 17:16:19 2024 +0000 newline at eof commit 0e5b7bf679b770c775a479c5db1759c2be70f101 Author: David Dixon <[email protected]> Date: Fri Dec 13 17:14:29 2024 +0000 remove := usage for py36 and add LibraryClient test commit 6529544fd206a665c7758e19e578847ab9b83398 Author: David Dixon <[email protected]> Date: Fri Dec 13 13:36:39 2024 +0000 Patch client writer commit 4654ea4e160746882bbd915ce189f4786db7a916 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri Dec 13 17:29:49 2024 -0700 Bump rocm-docs-core from 1.10.0 to 1.11.0 in /docs/sphinx (#1427) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.10.0 to 1.11.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.10.0...v1.11.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 545cc7ecbfcfebea013026a2a2e96b022f11fae5 Author: David Dixon <[email protected]> Date: Fri Dec 13 11:52:56 2024 -0700 Remove functions for generating file paths (#1402) * Remove functions for building paths to files and validation commit d1c7b59227cdf56b2bdf472d8b90f1415fe15782 Author: AndySu12 <[email protected]> Date: Fri Dec 13 19:26:27 2024 +0800 Update gfx942 BBS NN/NT/TN Equality/GridBased yamls for 1204 bbs bmm dynamic (#1446) commit 88fef9953dba5e0fbce8c106ab69f30482467f90 Author: KKyang <[email protected]> Date: Fri Dec 13 10:28:38 2024 +0800 Enable setOccupancyLimit for MBSK (#1438) * Refactor MBSK related functions * FIx vgpr occupancy not calculated correctly in unified mode * Support setOccupancyLimit for MBSK commit e453aebb2899e11534e6d0c45a7114e755838b83 Author: Braden Stefanuk <[email protected]> Date: Thu Dec 12 08:51:42 2024 -0700 Verify toolchain components at program invocation (#1413) * feat: toolchain components not globally accessible * feat: update version acquisition * chore: remove extraneous compiler verifications commit 8824d1acc92dc5bc3558638ade5dcb27ced8060b Author: Parth Kumar <[email protected]> Date: Thu Dec 12 08:54:31 2024 -0600 Tune Aquavanjaram942X F8F8 and F8BF8 TN for equality (#1430) * added F8F8 F8BF8 kernels * removed underperforming sizes * review comment fix commit e13e13351a8167fefec51974786f3822a6e38281 Author: hcman2 <[email protected]> Date: Thu Dec 12 10:36:01 2024 +0800 LSU supports larger MT and reuse LDS. (#1433) 1. Move LSU into LSU.py. 2. Do partial LSU when the LDS is not enough. commit ebd940f8e0351ae212280f2eb60f7d2717d4f72b Author: David Dixon <[email protected]> Date: Wed Dec 11 11:02:44 2024 -0700 Remove manifest functionality (#1401) * Remove manifest functionality * Remove manifest usage from cmake and client writer commit 573675edba6e5f0405f3033667194dcf731d2e80 Author: KKyang <[email protected]> Date: Wed Dec 11 14:03:07 2024 +0800 Support gridbased kdtree search for batched gemm (#1417) commit 82d40441735326edc918c066191eabb8ea03c82e Author: Feroz <[email protected]> Date: Tue Dec 10 21:35:14 2024 -0600 Tune aquavanjaram942 20CU equality TF32 NN GEMM (#1432) commit c92591ad11eb5c2e38f7a92f43626e13fbe714f6 Author: Henry Ho <[email protected]> Date: Thu Dec 5 09:24:03 2024 +0800 remove code owner commit ec597a6b5132c3573cbcfc8dd29a04620699a991 Author: Henry Ho <[email protected]> Date: Tue Nov 5 09:58:58 2024 +0800 extend MBSK WS to 40MB commit a62fb4e879da0028b5b6d51166dc8b084b0e5343 Author: Braden Stefanuk <[email protected]> Date: Tue Dec 10 09:12:08 2024 -0700 Remove unsupported assignment expression operator (#1434) commit 6190ac0ce61e45fbbc240b2fb38c31f3d1049144 Author: David Dixon <[email protected]> Date: Tue Dec 10 09:00:31 2024 -0700 Add alternative tool for incremental builds when tuning (#1431) * Add alternative tool for incremental builds when tunng * Ensure rebuild occurs when files are removed * Compute ARCH from co file and use ROCM_PATH * Simplify wavefrontsize commit dd826190a36b6a2cad5504fe774cdf8160835234 Author: David Dixon <[email protected]> Date: Tue Dec 10 05:26:04 2024 -0700 Reduce/clean up TCL output (#1426) * Reduce/clean up TCL output commit 79fdf7e2304729ed526940acaa373a315e100cde Author: David Dixon <[email protected]> Date: Mon Dec 9 06:30:03 2024 -0700 Remove unused write functions (#1400) commit 20590e11ab919357eab329fae27d9729617f4f63 Author: David Dixon <[email protected]> Date: Mon Dec 9 06:29:27 2024 -0700 Remove PackageLibrary option (#1367) * Remove PackageLibrary option commit 1ef6b6dafb00cea5f7edf77f6fdac53ce56a5fe5 Author: David Dixon <[email protected]> Date: Mon Dec 9 06:28:48 2024 -0700 Removes client config option (#1404) * Removes client config option * Remove unused global commit 7e7fcfee815e9f234b7f1ab8665b390755c25657 Author: wencchen <[email protected]> Date: Mon Dec 2 07:33:48 2024 +0000 add f32 equality logic yamls commit 4b13967503b2658afcc0d84f1f7e6da23e923736 Author: KKyang <[email protected]> Date: Mon Dec 9 13:59:18 2024 +0800 Fix cpuThreads == 0 not working properly (#1337) commit a457413b9a10c42f205408997547c9a7480bd1d3 Author: briannwu <[email protected]> Date: Mon Dec 9 11:34:14 2024 +0800 [OPT] Optimize tail loop (#1353) Use buffer_load_dword instructions instead of buffer_load_d16 If tailLoopOpt is False, original method(GuardK) will be used. Not supported: WaveSeparateGlobalRead1/2, DTV, LocalSplitU, SparseGemm commit c1f9582f7ca4aa5e60ff7ac91710eed1e0fdb4fc Author: Braden Stefanuk <[email protected]> Date: Fri Dec 6 15:20:12 2024 -0700 Code object compression via bundling (#1374) * feat: compress code objects * feat: add --no-compress flag commit d43d84a8ce849de9494f70221884bfb3a8fd7b9e Author: David Dixon <[email protected]> Date: Fri Dec 6 06:33:54 2024 -0700 Remove client/benchmark write functions (#1398) commit 8f687dba4b8e4d320f3a9507563364c12991a0cd Author: David Dixon <[email protected]> Date: Fri Dec 6 06:33:20 2024 -0700 Remove embed library option (#1403) commit a20a7b674bdf7c9d23f176b25507a46b13abbfa9 Author: wencchen <[email protected]> Date: Tue Nov 19 18:22:34 2024 +0800 f32 mac bug fix commit 71fc08c5862868ba10a2c2b30415b50860218476 Author: Hao-Sheng Chen <[email protected]> Date: Tue Dec 3 07:26:36 2024 +0000 Add f8, i8 and f16 logic yaml on gfx12 commit 7bf065b2a33b7e108fbe7296c8ebdb6c4e292d13 Author: Feroz <[email protected]> Date: Thu Dec 5 17:52:24 2024 -0600 Aquavanjaram942 20CU Tune HHS NN and TN GEMM sizes for equality and grid lib (#1390) Co-authored-by: Feroz <[email protected]> commit d16c2049ecf4ed9ede0173c463ce9b15b056a69f Author: David Dixon <[email protected]> Date: Thu Dec 5 08:25:50 2024 -0700 Remove write cmake function (#1397) commit 5fb2a4f9168aaab184396f08a3d74da36162ce84 Author: Alex Brown <[email protected]> Date: Thu Dec 5 07:39:14 2024 -0700 Add initial optional stream-k libraries (#1347) commit 3f8ae669ae53bc8fb674275128395f84787a850f Author: Alex Brown <[email protected]> Date: Thu Dec 5 07:38:41 2024 -0700 Fix stream-k kernels when K==0, allow beta*C calculation (#1399) commit 8ddfcabd8dc8b720ec9a4f3f9c3c57ada85d311d Author: Alex Brown <[email protected]> Date: Wed Dec 4 09:07:46 2024 -0700 Stream-k batch predict (#1410) * Add batch to grid predictor, and modify calculation of fixup peers * Disable fixup peers change for now, requires coefficient changes commit 663de3e1fbcc9d80f8db92e69eaf3090069d7a6e Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Dec 2 11:45:33 2024 +0000 Bump rocm-docs-core from 1.9.2 to 1.10.0 in /docs/sphinx Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.9.2 to 1.10.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.9.2...v1.10.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> commit b25be168c0dc1013e5d59c79fb964f338b215e3c Author: jichang <[email protected]> Date: Mon Dec 2 05:48:59 2024 +0000 tensilelite: executing asm-new.sh can be in different folder commit 52c9825671cec110bb2de3bff3c73c3c14df3aec Author: Feroz <[email protected]> Date: Tue Dec 3 10:12:47 2024 -0600 Tune Aldebaran BF16 NN TN NT GEMM sizes (#1323) * Tune Aldebaran BF16 NN TN NT GEMM sizes commit d7dc5cc4d6b741e4770c6f3bf44fe2ecc52820ac Author: Serge Lu <[email protected]> Date: Tue Dec 3 07:16:23 2024 +0000 Temporarily remove some test cases for swizzle-A commit 233894ebc00be0be784e3591e05df36e94fa1d57 Author: who who who <[email protected]> Date: Tue Dec 3 15:14:34 2024 +0800 Fp8 tuning upstream (#1380) * add fp8 tuning config * fix bugs * fix arch name for mi308 and mi210 * add scale and bias * add full mi * fix bugs * fix bugs * add activation pattern * fix a bug * fix merge bug when inData has no soltion * limit wave size less or equal than 4 * update readme * fix full stage * update pattern to match latest hipblaslt log * fix activation * update log example * apply full stage in gridbase tuning * fix dulplicate problem size introduced by solution index * pass unsupported gemm config * calculate gsu * fix a bug * fix a bug * skip slow kernels commit f52a366cd4dd80ea2ea32706b954b231f987c836 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.gith…

briannwu requested review from jichangjichang, KKyang, vin-huang, imcarsonliao, hcman2, Serge45, Jinp800125, TonyYHsieh and solaslin as code owners January 17, 2025 14:43

briannwu force-pushed the tail_opt branch from 38f62d5 to 0cab91f Compare January 17, 2025 14:45

briannwu force-pushed the tail_opt branch 4 times, most recently from ef4242e to 4b4f883 Compare January 20, 2025 07:58

hcman2 previously approved these changes Jan 20, 2025

View reviewed changes

briannwu dismissed hcman2’s stale review via 05902a0 January 20, 2025 13:04

briannwu force-pushed the tail_opt branch from 4b4f883 to 05902a0 Compare January 20, 2025 13:04

[OPT] Tail Loop Optimization

62c84ce

details: 1. Separate tailLoopOpt for A / B: tailLoopOptA / tailLoopOptB. 2. Not supported: DTV, SparseGemm. 3. Reorder load instructions with more vgprs.

briannwu force-pushed the tail_opt branch from 05902a0 to 62c84ce Compare January 21, 2025 06:22

aazz44ss self-requested a review February 5, 2025 09:11

aazz44ss approved these changes Feb 6, 2025

View reviewed changes

hcman2 approved these changes Feb 12, 2025

View reviewed changes

briannwu merged commit 4a104fd into ROCm:develop Feb 12, 2025
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPT] Tail Loop Optimization #1567

[OPT] Tail Loop Optimization #1567

briannwu commented Jan 17, 2025 •

edited

Loading

hcman2 commented Jan 20, 2025

briannwu commented Jan 20, 2025

hcman2 left a comment

aazz44ss left a comment

aazz44ss Feb 6, 2025 •

edited

Loading

[OPT] Tail Loop Optimization #1567

[OPT] Tail Loop Optimization #1567

Conversation

briannwu commented Jan 17, 2025 • edited Loading

Compare:

hcman2 commented Jan 20, 2025

briannwu commented Jan 20, 2025

hcman2 left a comment

Choose a reason for hiding this comment

aazz44ss left a comment

Choose a reason for hiding this comment

aazz44ss Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

briannwu commented Jan 17, 2025 •

edited

Loading

aazz44ss Feb 6, 2025 •

edited

Loading